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ABSTRACT 


Mero =papeh tsa Study of data base organization and 
indexing methods with emphasis on the evaluation process. 
ite approach of the study is focused on the data structure, 
mayor chatecteristic Of data base. Other aspects of the 
Buipject, mdexing, for Y¥example, were Fnmecussed iene atom 
to data structure. It was found that completeness and lu- 
cidity of knowledge of data base organization and indexing 
Pemeiedseas Mecessary if One 1S to do a good job of evaluating 


a data base system. 
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ieee LNT RODUCTION 


The subject of data base organization is receiving increased 
attention in the computer community. This is the case because 
data bases are used in almost all types of computer applica- 
mens, €.g., business and scientific applications. The com- 
macer community's involvement in data base studies has been 
mmatea to specific applications. Recently, it was realized 
miateapplication independent data base systems are important. 
Generalized data base system is the name used for this type 
of system. 

Various groups are making efforts to organize the study 
Saedata base system. Foremost in the field is the CODASYL* 
Eeestems Committee [3, 4}. This committee has as its goal the 
development of specifications for a common language and func- 
monica CONCUrrent With the initial publication of the CODASYL 
Pocems Committee's report, arguments were raised as to the 
advantages and disadvantages of generalized data base systems 
as’ compared with the traditional approach of application 
wependent data base systems. To fully understand the differ- 
ent arguments in favor of or against an approach, one needs 
Gomccive deeper into the subject of data systems. This paper 
Peoieretempe tO put into perspective the multiple and 
diversified data base system developments. 


1 CODASYL 195 the abbreviation of Conference on Data Systems 
Languages. 





II. CLASSIFICATIONS OF DATA BASE MANAGEMENT SYSTEMS 


Data base management systems have various characteristics. 
BmemeohAotl Systems Committee in its technical report dated 
May 1969 gave the following major data base characteristics: 
Mea structure class, storage structure class, generalized 
processes provided, language type, language form, modes of 
use, file media, hardware environment and operating systems 
environment. Almost all of these major characteristics have 
PvelicoittGations, €.¢., under generalized processes pro- 
Peer oromare file definition, file creation, file updating 
ama interrogation. 

The ensuing discussion will deal with the more imnortant 
feeds Ot data Organization characteristics. The system eval- 
uation process will also be discussed as much as possible. 

A number of these characteristics are clewend Cl = amar il c alliiomne 
7S the file media option. One would keep in mind physical 
Storage characteristics when selecting a data organization. 
Also, the hardware and operating systems environment must be 
considered. Sometimes these facilities are fixed beforehand. 
BaeGumemecascs, the Computer system 1S selected in addition 


to the data base system. 


Pea tA STRUCTURES 
TitieseeGrtme rs the logical or hierarchical relationships 
Ht a Vemecmillmimne data base. In contrast, storage struc- 


Miami mick neoneinizadtion Of data elements on the 





miysical Storage system. And since data relationships usually 
pemst amone records, data structure can further be viewed as 
the relationships among records. Based on this viewpoint, 
data structures are classified according to how the records 
are cross-referenced. There are three general classifications 
mma Structures: {1) Multiple record file: In this clas- 
Seeveacion no record “owns” any record. Records are related 
memeeaehn Otner only in their physical positions; hence, their 
Memationships can be defined only by logical keys. For 
example, for records stored sequentially, a record's key may 
Demeomaller or greater than another record's key, depending 
mietts position relative to the other record. (2) Hierar- 
@mmea! tile: this type of data structure is sometimes called 
a tree structure because of the schematic representation of 
meomaata relationships. in this classification a record can 
Sotienany number of records but it can be "owned" only by one 
fmocord. For an example, in Figure 1, record D "owns" records 
imeond ) and is "owned" by record A only. (3) Network struc- 
fre: in this classification a record can "own" or be "owned" 
by any number of records. For an example, in Figure Z, record 
Deeowma stecords | and J and is "owned" by both records A and 
eet pondon, 1973}. 

bicmpomthic advent Of Girect access storage technology, 
TMineipbesrceora £1)e€s are not as numerous as they once were. 
Thus, to a great extent, tree and network data structures are 
eer eliiaimamdata within a file or from one file to 


MiGuictemerUntrhoenosthese data structures are found to be more 
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NETWORK DATA STRUCTURE 


Figure 2 
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Smiacient for many data base applications. Chains and chain- 
lWegarier ring structure and list structure are two main methods 
used in cross-referencing data in tree and network data 
Seructure. 

ieeehain Data Structures 

Chaining is a technique of linking logically-related 
memes by means OL pointers; pointers are special fields 
meeorporated IntO each record structure which contain refer- 
ames) to the next logically-related record(s). A group of 
i[memords SO Structured and linked is called a chain. 

Chaining has a number of advantages [London, 1973]. 
ieaeielikechains Can be accessed by several keys; the number 
mmxeys adepends on the number of independent chains a record 
Memongs to. The use of multiple record keys allows accesses 
MemmeccOras. it also allows records to be related in a number 
Of ways. Further, it allows data retrieval in a logical 
aemmence regardless of the physical sequence of records. 
Another advantage of chaining comes to light when additional 
lmemorts are required after the system had been installed. 
Chaining allows users to produce newly required reports witn- 
Meeenecessarily restructuring the file. 

PP tomencmcmoddvantages [London 1973], the most crit- 
ical is the cumbersome updating procedure which requires 
Tami tcdyt Processine time. This disadvantage, however, 
moc be ¢ceereasecd tO a reasonable level by planning the 
system updating. Two main factors have to be watched in 


regard to system updating: the frequency of additions and 


sede 





G@emetions Of records in the chains and the frequency of data 
Smemees within the established chains. (The former is usually 
referred to as the volatility of data.) Other disadvantages 
mmechnaining are record growth and the length of chains. 
Search time to retrieve items from the chains is directly 
Me@e@ertional to the size of the records and the length of the 
Gmains. Thus, although this disadvantage is less critical 
meme tie £ormer, it 1s by no means unimportant. 
meeeebist Data Structures 

Mec eeaatasstruetune sts Comparable fo chaining and 
memeconmeceptually an extension of Chaining. However, instead 
Mmeamesime pointers to link records, the list data structure 
technique uses list, such that each list of addresses (or 
jeeords) Contains logically related items only. Lists are 
used for the primary purpose of reducing the length of search 
time (the second disadvantage of chaining). <A type of list 
moethe simple list. 

A simple list approach is used in conjunction with 
Sens Of records. If chain length is to De restmreted OG 
if-the search time is to be reduced, the chains are divided 
into a number of segments, each of which owns an entry point. 
fai tobe Of these entry points is the list. The entry points 
ieemuced as the primary keys in searching the file. Actual 
Memmeoocs Can be used aS entry points. Variations are pos- 
Protein buiiding lists. The primary keys may be made to 


PPiimMoOmenenalvanweniry point to a chain if the chain is long. 


eZ 





On the other hand, the number of primary keys can be reduced 
Mmememe Chains are relatively short. 

Pores creams and 6 are diagrams of a simple list 
and its variations. It is assumed that a storage drum would 
memmsed. ihe categorization of records per specific char- 
meereristic is termed an attribute. Further, in these figures 
it is assumed that the primary index would be composed of 
micwrance policy numbers of car owners. Chains A, B, C, D, 
meeemare used to designate each thousand series of policy 
numbers. 

mre ont icenumoer Of records 1S on the average 
fists enough tO Occupy one track of the storage drum. This 
facilitates the storage scheme Jeerene ie eee Gels eee 
meemechain COuld be stored on one track each. 

If the average number of records related to each attri- 
Wimeen1s large, it would be necessary to store the records on 
emer than One track. In this Situation each primary index 
entry can be made to point to more than one track. This 1s 
what is done in Figure 4. The index entry of 1000 is used 
mmo: tO track #) and track #2. Aliso, the primary index 
eieryeoL 2000 15 used to point to track #3 and track #4. Note 
that pointers are needed from the last entries of track #1 and 
Maceo tO the first entries of track #2 and track #4, re- 
Bpectively. 

Vir tiem ommcmanother list variation. The number of 
Peordoupeheattripute 1s small and can be contained on some 


Pp eemenemcaGhmttackanecords related to other attribute(s) 
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memla Still be stored in the unused space of the tracks to 
mueiyetuitilize them. Hence in Figure 5, the pointers could 
Semiacdes tO point to every other attribute (1000, 3000, 5000, 
mOO0) . 

Assume that we want another list (secondary index) 
that would be composed of insurance expiration dates. Since 
the records are made contiguous on the basis of policy numbers, 
femmieed pOinters to follow the sequence of insurance expiration 
dates. This is shown in Figure 6. Since there would be two 
meses. the primary and the secondary indices, the scheme in 
fmmere 6 15 a second degree inversion. 

Pnimecelists are extensions of chaining, hence they 
have the same disadvantages, except that the extended search 
iit .in chains is €liminated. The search capability of lists 
is of considerable use. Much of the searchings is done in 
milewlist, thus unnecessary retrieval of records is avoided. 

The degree of search effectiveness of a list is directly pro- 
mertional to the degree of file inversion. . Inversion of files 
Ome the inverted list will be discussed later. However, the 
Simple list requires more storage space than chaining. De- 
pending upon the application, the reduced search time may well 
beeworth the additional space. 

Winewextnene OLmiistmaata Structuring is the inverted 
bie lhis approach requires One entry for each type of attri- 
beter in the zile; hence, the lists would hold the references 
betvecn ©ecords. Conceptually, thes inverted list is a series 


Pst s OL pointers to data records. The lists can be held 
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at the beginning of the file or at amechee leeaciden. Dine 
inverted list's record insertion and deletion procedure is 
mae same as that of a chain; 1.e., record insertion and de- 
letion is done in the file area. However, the difference is 
that all changes to the status and linkages between records 
meemaone In the independent list area. The inverted list 
mmovides a very flexible response to user retrieval requests. 
lees type of list provides for data retrieval on the basis of 
Maoerablée parameters. To illustrate this capability, assume 
that conditions A, B, C and D are the variable parameters to 
Memersed as the basis for record retrieval; i.e., only records 
Mmieremeet these conditions are to be retrieved. The file to 
be maintained for this example would include lists maintained 
for each condition. Upon request for the record(s) that meet(s) 
the given conditions, the different condition lists would be 
Searched and its intersection would be the required record(s). 
mmeocessing time to TOek this kind of requirement would be much 
reduced if the condition that has the smallest number of 
Mmecords 15 searched initially, then the chosen records could 
memecarched in the next smaller list and so on. This technique 
mempetcen rererréd to as the least list principle. Another 
mechnique for reducing processing time is to numerically order 
the records for each condition and then apply the appropriate 
mearch algorithms. 

PiewUsemotermmverted lists has two principal advantages. 
The primary advantage of this approach is that complicated 
Pile mnecduicats can be processed efficiently. Another 


advantage is that processing time for updating is much reduced 


1s, 





when compared to chaining, since most of the updating is done 
mimeo lists Gather than in the file area. With this type of 
Weaeoach, nO POinter is incorporated with the data in the 
meecorda structure; thus, the reduction of record size could 
wolewnat Dalance the storage space occupied by the lists. 

PiscUmoOn mene arerementroned arguments for chaining 
omdelists, the choice of an inverted list is warranted if 
meeeceuive data retrieval 15 frequent. This question 1S some- 
Ome the evaluator (with the guidance of the users) has to 
Meee. A Statistical analysis should be made of the frequency 
of retrieval of data elements prior to the selection of a data 
Serue cure. 

Momi@mcmotepetariner, One May take into consideration 
mites possibility of using an involute type of data structure. 
mmenvolute structure 1S a logical extension of the inverted 
mist. It 1S a Structure in which the data elements replace 
miempOommters to records from the lists. The physical records 
meceneplaced by linkages of chains, each element of which 
memtemes tO a list, Each list would contain all the data 
Peoiewtomeelonging CO 1t, hence elements could appear in any 
Miner OF wlists (up to its maximum). This structuring provides 
Beitpee toe lexibility in file accessing and producing reports. 
Norever, fhe main arguments against this type of structuring 
Memb omincrcased processing Lime and its increased storage 


Pereceledurrement to maintain the lists. 
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eo PORAGS SITRUCTURES AND FILE MEDIA TYPES 

Poepliciwmousiy Mentioned, Storage structure is the actual 
Mayotcal data Organization in the system. Generally there 
ioe etwO poSSible ways of storing the data: sequential or 
mom-sequential (random). The method of data storage is in- 
womtably dictated by the type of access that may suitably be 
femeted aiLter selecting the data Structure design. Three 
myoes Of access may be sequential, indexed-sequential and 
feeecet. oequential access does not lend itself to random 
Storage of data, since the retrieval of records is based on 
emer pre-designed logical sequence. For indexed-sequential 
access, sequential storage of data is applicable. The index 
miemtetual records are sequentially stored. Direct access of 
eee rOr retrieval is accomplished by using an algorithm which 
eomverts between the record key and physical address. 

The following is a tabular summary of storing and accessing 


mr records: 


TYPES OF ACCESS TYPE OF STORAGE 
SEOUENT TALS -: RANDOM 
sequential : Yes No 
imicdexcd=ssecaquential  : ies ; No 
DSi es i ies 


McmckccttoneOr fide Media 1S dependent on the storage 
meructtire. tf random storage is desired, use of magnetic 
Hapes 1s mot feasible. Tapes could be considered as file 


media only if (1) the data are to be stored and accessed in 


zal 





sequence, (2) the file is small and/or (3) the frequency of 
access is small. 

iiemstObave Structure to be adapted should be a result of 
Mmiemlusers' requirements (speed, space, etc.) and cost analyses. 
ape l provides a comparative cost of a sample of storage 
Mmees, lt Should be noted that the processor storage has the 
mepomest COSt per byte and the magnetic tape and data cell have 
mime Lowest cost. Usually, the faster the storage unit, the 
eeeer the cost. For each type of storage unit the cost 


Seeererence is usually directly proportional to the capacity. 
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IIT. ANALYSIS OF DATA STRUCTURING TECHNIQUES 


Data organization is a term applied to the various methods 
Sireorganizing data within a file. While Dodd's term is data 
meemezation, Lefkovitz calls it file organization. [House, 1974 
mmemebetkovitz, 1969]. There are various schemes of organizing 
Meee within a file. Dodd categorized all these schemes into 
Maimeen asic types of OrganiZation: sequential, random, and 
mst. 

MOE nenpreceding Section four distinct types of data 
mele cures can be related to these data organizations' basic 
Mmetsstzrications offered by Dodd. They are: multiple record 
mero. Chaining, simple list and inverted list. Sequential data 
meee zation, as Dodd defined it, iS associated with a multiple 
mereend file. Chaining, simple list and inverted list would be 
m@eluded in Dodd's list classification of data organization. 
ieeec data structuring methods are analyzed in this section: 
aimeerole record file, simple list and inverted list. They are 
Sere tO) describe the different data organization techniques. 
memtcecevident from Section Il that data structure is a factor 
Or prime importance in the evaluation of the data base manage- 
ment systems. 

Mimeverwitalmoe data base systems, various criteria have to 
femeecotctaerca, WNitterent authors give different sets of cri- 
merta. the following set of criteria is deemed comprehensive 


Mier tic MOSt important one: retrieval capability, maintenance 
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Capability, storage requirements and accuracy. The analysis 


Memmiicethree daca Structure methods is based on these criteria. 


Peek e tT RIEVAL CAPABILITY 

Figure 7 is a representation of the time involved with data 
maumeteval, slime delays can occur between the central processing 
mat and the file storage, depending on the data structure type, 
available memory and index size. This will be explained later. 
Sefer factors Contributing to the time duration between query 
and response from the system are not considered because they 
are not relevant to the analysis of the data structures. One 
Smee se COntributory factors is “handshaking,'' a term used to 
feemonate the process of synchronization between the terminal 
fimaeche computer when a query is logged in. 

Miecne analysis we are involved only with the CPU proces- 
Sang time and storage units time. These are the time elements 
miat Vary with the data base configuration. The parameters in 
mmeetime formulation are related to these factors.° Veialle It 
momaetiSting of these parameters, definitions and symbols. 
iiiestile-and query-related parameters are dependent on the data 
imiaeac. !he devyice-related parameters are dependent on the 
ibe media characteristics. 

The file-related parameters are: the total number of 
Perrier Keys, tOtal number of records in the system, average 


mumber Of keys per record, average length of lists and average 


‘The parameters and time formulation to follow are mainly 
Byelerkovitz. In some instances, a different terminology is 
msed. 


58) 





) t= 0 i= 2 
TERMINAL | 


transmission 
time 









query input 







transmission 
sa ee) eave Wee 
dela 
fy 
CENTRAL , 
PROCESSING Soa 
UNIT (CPU) _ CPU processing 
: : = time 
MAIN MEMORY ae 














processor 
queue delay 


index record 
access queue 







EG UIDs peripheral devices Q 
operations time (read) 

er ee a a 
time increment variable 
MmoGimev ale CLme 
(fecaticn of 17) 


ct 
HW) ow 


PORTION OF DATA RETRIEVAL TIME TRACE 


eG en 


26 





SABLE IT 


Seb OleweriNtt TONS OF FILE PROCESSING PARAMETERS 


Symbol Definition 

V Zeb echeOrnartStiner keys in 
vocabulary 

a ; Number of records in system 

Ny ; Average number of keys per record 
Ny N 

L meer aec last Length = — 

Ny me Veroteenunnersot KeyS in a sinele 
: Gile Ly wo hOaue t 

N weeverage number of non-negated 

—_ tems Mma aetnele query product 

b. Me wcerape Shorresteinste length in 
query 

O MmekatrOrot cuery response to De 

A Ze cro thio record addresses 
: Det Mewar Divemeal recoimd 

i Random access time of media 

R >: Rotation time of media (direct 


access only } 
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member Of Characters per record. The keys are the pointers 
Peweed in the index Or indices for the list-type of data 
Semieture. Ihe average of the number of keys per second is 
monouced over all the pointers to each record. 

The query-related parameters are: average number of keys 
mimaesinegle query product, average number of nonnegated keys 
mea single query product, average shortest Lie Jhemee jer 
query and the average ratio of query response to the average 
mmmeest list length per query. The term ''query product" as 
Maemenct ec re: ers to the record retrieved as the result of the 
faery. ihe non-negated keys are the keys not eliminated from 
Semoideration in the process of searching record pointers 
feemin the index. | 

The file media-related parameters are: number of file 
Meeord addresses per file media record, random access time of 
mebe media, transfer rate of file media, and for direct access 
eeonage devices (DASD) , BOtat Ole t Ime. 

Excluding query interpretation time and the index directory 
@ecoding time, the list search retrieval times are: (1) for a 
Semple list: 

Time = List search and record transfer time 
mm (2) for an inverted list: 
jane = hist intersection time + List search and 
record transfer time. 


meme the symbols, the formulae are: (1) for a simple list: 
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a 3 
Weg Gee 1. SR) 


ama 2) £or an inverted list: 


sok 
Ton = regu) ae Taurine) ot ( T.,. ce alae) CL.) iene 


moore Processing Time 


MiemGnlmaroces cine time Var1es with the data structure 
maeeew FOr a multiple record filing scheme, the CPU processing 
mje Or a Query may not be significant. Upon receipt of the 
ewe tne CPU would determine which file is to be processed 
and the storage unit would be accessed and reaccessed until the 
mererca record 1s found. For the list-type of data structure, 
Mmiemprocessing of the query would be more involved. 

V-ivericioncwattee: this interpretation process of the 
mmeiy by the CPU. One of prime importance is the index and 
fehen partitioning technique. When partitioning the indices of 
meeomlolicated list data structure, the frequently accessed 
mioerces Should be stored in fast DASD. In this way, the sum- 
ieeton Of indices access time is reduced. The same consider- 
menoneapplies to the partitioning of the records of the file 
among the file media available. 

MOtieterIete nese tne anaex Size. the index must be 
meecsse? ana transterred to the main memory if it is not in 
main memory. If the index is large and cannot be contained 


my the available portion of the main memory, a succession Of 


*The Mi eriteiwinesmOn R 15 arrived on the assumption that 
pmol Steals record OCcupIes Ome track. Otherwise this multiplier 
would be the sum of the average track rotational delay (.5) and 
Miemitaetion of the track occupied by a physical record. 


ao 





accesses and searches of the index may have to be done before 
Maer icht pointer to the record is determined. For a simple 
ist and an inverted list of the same number of records, the 
Mmeserted list is likely to have a larger index size because 
@iesnumber of pointers to each record is usually more than one. 
memea Simple list, there is only one pointer for each record. 

Pmeoicwoitcwinvalves more than one degree of inversion, 
miimeocaucnee Of index access and search is also important. 
Meually, the shortest lists should be accessed and searched 
mest if] Order tO minimize the total length of the index to 
beme@perated on. 

Roumamecompitcated list, the associativity of the records 
Memmenother factor that affects the time duration of query pro- 
Gessing by the CPU. If the records have minimum association, 
iee., Only one or a few keys are related to each record, then 
Mmiemtewer indices need to be intersected. This implies that 
Mess) time would be required to retrieve a record from the data 
base. 

hMeoe@emccucCeGIViltyels another factor. The file is con- 
miened to be highiy selective if each key points to only one 
meeord in the data base. A completely inverted list, therefore, 
mmouemrehest Selectivity. In this case, while the time for 
Mmeweinccrseet1on would be longer, the search and access pro- 
@ess Of the record would be shorter than for the simple list. 

Zeaocorage Wnit Operating Time 
ie wieapemoreneagda that 2s used will affect the time 


duration between query and response. Usually the indices and 
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the records of the file would be on the peripheral eVaeets )e 
The choices are many: drums, discs, magnetic tapes, paper 
lmapes, etc. 

iniecwicdiinmeiaerorseco be considered in selecting a file 
iteamamare the cost of the equipment and its speed. It is 
Mipoetant to note that new technology is forcing reduction in 
eeoraze cost. 

abt wetiercra sample of DASD timing factors. For 
typical magnetic tapes, random access time is one minute and 
Seema access 1s about four milliseconds. See Table I for a 
fmamorc Of storage unit capacity and cost. Typical file media 
imstea today are: 

meee Card. Slowest storage unit but Pie xpelisave: 


meme caper tape. Least expensive storage unit that could be 
Mecadenor direct communications to the computer. 


meeevaenetic lape. Serially processed; inexpensive mass 
eeorage media for large files; could be used for very fast 
off-line data transmission. 


(4) Small Drums. Small backing storage for high activity 
maed Or programs; may hold indices. 


(5) Small Disc Unit. Interchangeable file media; fast for 
meatal processing; requires seek time of about 50 to 200 
milliseconds before access. 


Moieeebarge Disc Unit. Larger capacity; access time about 
fwoeco 180 milliseconds. 


momevarce Drum. Largest fixed storage presently available, 
mowed Drillion characters; fast access time of about 30 to 
50 milliseconds. 


(8) Magnetic Card File. Interchangeable cartridges of 
mamcoem access time Of about 255 miiliseconds. 


Pee teniamertesstripn File. Large random access file; 


mic menaneea to iis Of Strips; access time of about 90 to 
Miteiieiiseconds, cal be used for fast serial processing. 


el 





Tae) le es 


Svat bvealnRECT ACCESS STORAGE DEVICES 


Model ; DASD Type 


2301 =: Drum 

esi0 1 : Movable head disk 

wal | Disk pack (movable head) 
Zo2) Meenetiec strip, cell store 


eeata cell): (movable head) 


eePrevious strip restored 


femerrevious Strip not restored 


Rotation: 


(ms ) 





wipe Large Core Memory. Most expensive; fastest access 
femne, access time Of about 8 microseconds. [Martin, 1967]. 


IMemtnits may be either removable or nonremovable packs. 
Pemevable packs have longer access time, lower recording den- 
[ieemand lower Cost per byte than nonremovable packs. Further, 
Peemrormer offers unlimited off-line storage capacity while 
fmemerbatter's Capacity is limited to the on-line storage. 

Partitioning of files among the file media is not the 
mieey ee cechnigue available to the evaluator or designer of data 
aoe systems for maximizing the speed of file operations. 
Various techniques of reducing seek and access times of DASDs 
mame peen developed. The main approaches to reduce DASDs' 
@eeration times are the following: (1) Records that are 
mbm eaccessed im Sequence are selectively stored on the 
device. Record processing time after an access is computed. 
Aiter accessing a record, rotation time is allowed equivalent 
fmethe computed processing time. In this way the next record 
Gam be read immediately and no unnecessary rotation of DASD 
is incurred. (2) Another algorithmic approach to reduce the 
device operation time is to time the rotation and availability 
of Ped neagsmiapplcapile only to muliti-head DASDs). See pages 
fomtoez96 Of Ref. 14 for above algorithms and examples. 

ae eco emuctune Types and Data Retrieval Time 

To review the preceding discussion, we note that the 
mases Ot the differences in data retrieval times attributable 
Pere ehUetubecsarce, ) the time spent by the CPU on processing 
Of query, index and record and the storage unit operating time. 


hemeconvenvence, we would designate this portion of data 


Oe. 





retrieval time as the data retrieval sub-time Charo es In 


mmo respect, there are three data structure types that are 
eme@emeconsidération: multiple record file, simple list and 
inverted list. The following formulations are based on a single 
mecord retrieval. 
ae Multiple Record File and Data Retrieval Sub-time 

Didcewrncemmltaphe record £1leé type of data struc- 
foie, Once the query for a record is received and interpreted 
femeenie GPU, the search for the record will start. No index 
famiidaices have to be accessed and searched and intersected 
for the record address. Rather, a succession of accesses and 
meeprocessing of records might have to be done before the 
Biesired record is found. This CPU processing 1s -aetual ly sche 
@enparisOn process of records. With this type of data struc- 
meecmmmene records Should be stored sequentially, especially 
if Magnetic tapes are used. If records are randomly stored, 
mamdom accessing of records hasten De scene . 

When DASDs are used, other search algorithms are 
meer ble. One of these is the logarithmic or binary search. 

Pimple hasSkoand Data Retrieval Sub-time 

Data retrieval sub-time for this type of data 
weructure has been partially formulated in the previous dis- 
Mession. Index decoding time, however, has to be added to 


fee previous formulation. Sub-time would be 


Tub . Sit . Ny Tepu ii 


cee item weoOreasconplelLe picture of the data retrieval 
eEmiC . 
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i eee e+ NT 


sub percep 


Assignment of values for (oe and R are dependent on the storage 
Miienaverape Operation time (see Table III). 
Cee iverrceuiict and Data Retrieval Sub-time 
Sub-time formulation for this data structure type 
fmealso been partially done. Like the simple list, the Faviltss 


@ecoding time has to be added. The resultant equation is 


Pub 7 ae Nt Pepu aia 
~ | 
lan = a Nee, Ge. se mealee ye) et: pL. Cs i le Shae ct Me U erpea 
Oe 
waar 
sub ={ta- CN,) sf pL iT, * 1.5R) Ne 


foeser like the simple list, assignment of values for a and R 
fmeemacpendent on the storage unit operating times; sample values 
meer lasted in Table III. 

Nowemehat the Tepu EOnmencmllist StLUctUres woelmld) be 
eererent from T of the multiple record file. The T times 
| cpu cpu 
emcee cependent on the number of machine instructions necessary 


pomcompare records and to decode the index. 


eee MAINTENANCE CAPABILITY 

iMmcone cata Dase Systems, more time 1s expended on main- 
memmineg files than on searching them. Because of this situation, 
file maintenance speed and efficiency determine the cost and/or 
Measaibility of the entire system. It is reported that 20 to 30 
foeeeeit or total machine time 1s spent for the file maintenance 
function of sorting in general business applications. [Mcadows, 


1967]. 
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Sorting and merging are the best known as well as the most 
common file maintenance functions. Another maintenance function 
femeeve Initial file creation. The file maintenance functions 
Sammepe SUbdivided into: (1) addition of records to a file, 

(2) deletion of records from a file, (3) changing the value 
Gummer ecord's field, (4) changing the record's structure, (5) 
mete record sequence in a file, and (6) changing the file 
meorace media. The fifth function is sorting. 

Milcmsoupceet lon will describe the capabilities of the three 
fate dita Structures--multiple record file, simple list and 
Merced list, in terms of the above functions. Simple and 
mmemted list will be included under lists structures. 

meeeaimtenance Of ayMultiple Record File 

Ticmii~mbecnaneesOt dumurtiple record file is trivial 
Unless a sequence of records other than their arrival sequence 
Mmemeteomred. If records are not sequenced, the new records are 
wemeaiiy addéd at the end of the file. In this case no complex 
Search for spaces for the records to be inserted is required. 
meeeercUatLiTon 1S entirely ditferent if a record sequence 1S 
Maintained. Here oftentimes the entire file must be copied. 

TorcoOnlmOrenceeraseean be dome in more than one manner 
feet cnhis data Structure type. One method of deletion for this 
data structure is to flag the records to be deleted as such by 
ehanging a value of one of their fields. By this method the 
meGords tO be deleted would not be physically removed from the 
fie. Another method of deJeting records is to copy the entire 


file, excluding the records to be deleted. The latter method 
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is necessary if space is needed for new records. The changing 
Mumerceords Content or structure is done by changing, adding 
mmmeceleting tields in memory. The fifth function (sorting) 
imorves altération of record sequence. Multiple passes through 
the field may be necessary to accomplish this. The last func- 
tion, which involves changing the files medium, is considerably 
meoumple process. 
mee Maintenance with List Structures 

itiectent wyrocedures £or the addition of records to a 
Misteadata Structure are diagramed in Figures 8, 9, and 10. 
Variations among the different procedures are dependent on the 
meammerine Of the file to which the new records are to be added. 
iimesrecords may be arranged in sequence logically or physically 
Perehin iit ont Orrexamlple., in Figure 9 the new records 
ieee ovically sequenced within the main file by means of point- 
Emer in this figure, the insurance number and the age pointers 
are used to point to the numerical sequence of insurance num- 
bers and age, respectively. The updating procedure for the 
[ieieiriie using this scheme 1s complicated because the pointers 
Migete pe Updated. In Figure 10, the new records are physically 
Sequenced by hundreds of insurance numbers per track of the 
osc. Alter placing the new records in the main file, the 
mecords are posted to the appropriate indices. This posting 
Peetat1On 1S a complex one. it requires either that the index 
files (indices) be copied as records are being added at many 
Perrmessunrougchout the file [Figure 8], or that chaining 


Beetnomomoreltst structures [Figure 10} be used to permit 
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Momecontigueus placement of records in the index files. The 
use.of chaining within the index files is a complicated scheme 
because the succeeding record pointers have to be accessed and 
mmecrmcd fOr every record that is added. The complexity of the 
Mmeeess increases as the degree of file inversion increases. 

The main advantage attributed to list file, with regard 
memcddition of records, is that the logical sequence of records 
ete CONtIinuousSly maintained. The copying of files, which is 
mmmomenecessary for multiple record files, is avoided. At 
meomy Only the index files have to be copied. For a simple 
ise there 1S Only one index. However, for inverted lists the 
foimper Of Indices that may need to be copied can be large. 

iiespuecess or record deletion meth listsfvles is) also 
Complex. If chaining is used within the main file or the index 
files, a complicated process of updating the pointers would be 
imommrred. A tfurther problem involves the accounting for spaces 
feettable to the records within the entire file. This would 
Baume rease the processing time and grab spaces from the addition 
mmemceletion process during execution. 

liicmpuocess Of Changing the record values would be a 
Eeunpbe and fast process with list files. Pointers could be 
MeeecdetoO the required records. The only factor that could 
@oi>oe prolonged search time for the records are pointers that 
aimee. CS POrt1ON Of the file. In such cases, numerous 
transfers of records between main and auxiliary memory would 


eecur. 
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Peaceonaeseduemec @hange 1S the most difficult main- 
memamce function of list files. Many accesses to records and 
alterations of pointers may be necessary for relatively few 
Saamees if Sequence. The last function of changing the file 
medium iS a simple process; it involves copying the file from 


ome medium to another. 


eee > LORAGE REQUIREMENTS 
Another criterion is storage requirements. Auxiliary 
Storage requirements are analyzed in this subsection. 


Mmeclcctllary storage Requirements of Data Structures 


There are three basic methods of storing records on 


Meeelrary devices. They are: sequential records, indexed 
feecOrds, and chained records. In sequential storage, records 
mematlonships are based on record adjacencies. In indexed 


record storage, indices are used to relate records to one 
@memerer. In chained record storage, related records are chained 
meeether . 

Figure 1] depicts sequential record storage. Figure 
(a ) mimcaMent na lestorace Of fixed length records. Note 
mnat 14 bytes of spaces are wasted. Figure 11(b) is a sequen- 
batiestOrdge Of variable length records. The total spaces 
meamrned are 35/ bytes, a saving of 8 bytes. The difference 
would be great as the number of records increases. The situa- 
maoimecolid be the sonposite though. This would occur if the 
meecords could be made of the same length. For example, if 
Parco owrambemrenresented as numbers in records, say "010275" 


Pome ee eetovion Oo0475" tor May 4, 1975, then it is better 


to use the fixed length records. A? 





Record Number 1 2 & 4 5 


| ue | MACDONALD] PETERSON | BELG  pavinsoy | 


Penh eee ote DH 


(a) Sequential Storage of Fixed-Length Records 
(all records are 9 bytes long) 


Record Number 1 


| 2 3 4 5 
[3 [LEE | 9 | MACDONALD | 3 PETERSON 4| BELL| 8 DAVIDSON | 


ae men eee G1 | + — = 


lb ieeoequential Storage of Variable Length Records 


SEQUENTIAL STORAGE OF RECORDS 


Figure 11 
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Witmer Z orenadtagrams Of indexed records storing 
Schemes. Figure 12(a) is a single index record storage scheme. 
fmeoee 12(b) 15 a multiple index record storage scheme. Note 
miei Figures 1Z(a) and 12(b) the use of fixed length records 
would add to the space requirements. Compared to the sequential 
Storage of variable length records, additional space is required 
meee indices. The additional space required would increase 
with the (ee oem iny cision. POsSibly the index tables 
[men ecccantme Size Of data. This is the price one must pay 
memetne Capability to relate records using list files. 

MitMmeqmuceOnud StOrage Teaquilres more space than sequen- 
Mmemeerecord Storage. Added space is required for the record 
Pepe ns- Compared to index records storage, this scheme re- 
quires no space for an index. Figure 13 depicts chained record 
mweerace tor a simele and multiple chains. In this figure the 
Pomme rs are used to follow the numerical sequence of the 


attributes of social security number and age. 


‘D. ACCURACY OF DATA RETRIEVAL 

meecuracy, scan be Getimced as the dégree of data retrieval 
meme etness SUDb}JeCt to some degree of effort exerted. This 
@oeaetetriecVal accuracy concept can be subdivided into what 
is commonly known as the recall and precision ratio. The 
Paerenatio 15 )derined as the degree of success in retrieving 
Relevant Gddta trom a relevant System. Recall ratio is the 
etree soe capapility to let through the desired data. The pre- 
Cision ratio is the system's capability to hold back unwanted 


Miecrciuantitdtyvely, these ratios could be expressed as: 
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Recora Structure 
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Sie lem inde« necord storing (simple 11st) 
Record Structure 
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File 
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Multiple Index Record Storing (inverted list) 


INDEXED RECORD STORING 


Pape iie 92 
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ReGermdeotructure Data File 


AGE POINTER 









1001 
1061 


(i 


1046 


Age Pointer 


(a) Single-Chain Records Storing 


hmecora Structure 


uve lace SS# | AGE POINTER|SS# POINTER 


Data? 1k 


o1025248 
MACDONALD | 26] 123456789 {1059 








1001 










1OZ2 









Age Pointer 
Social 

security No. 
Pointer 


(b) Multi-Chain Records Storing 
CaeNED REGORD STORAGE 


Paget es 15 
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Recall Ratio = 


Number of Relevant Data Items Retrieved by the System ao 


Total Number of Relevant Data items in the system 


Mmeecision Ratio = 
Number of Relevant Data Items Retrieved by the System | 
piece er ieraaicens Retrieved by the Systen «~~ «2 00- 
iemeatinly understand these ratios, the following example is 
Sons taered: 
SUpe@cemmenercearc SU Gelevant records in an inquiry 


Mmomaedata base. A system search is conducted and 25 of these 


meeevant records are retrieved. The recall ratio is 25/30, or 


mene 652. Assume that another ZO irrelevant records have been 
meaemmeved with the 25 relevant ones in the process. The total 


Mmemmnevyed records are 45, hence the precision ratio is 25/45, 
Suapout 563. It is usually stated that the search has a 83% 
meet at a precision of 56%. It has been observed that as the 
recall ratio increases the precision ratio decreases: A typical 
mormmot recall ratio versus precision ratio is shown on the next 
page. Peamcaster and Fayen, 1973). 

Dapaestemmeture type 1s certainly a rélevant factor in 
faeomlining a system's data retrieval accuracy. Unfortunately, 
mois NOt the only factor that affects the system's accuracy. 
fee ot data base and volatility of data are other factors. Due 
Pmeieomultipnlicity of factors, it is difficult to determine the 
weslnacy OF Tetriceval. One may approximate the system's accuracy 


by simulation. Simulation should, however, be the last recourse. 
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fee LTRANSIENCY 
iris ency Orewa data base 1S its tendency to be disorganized. 
The more transient a data base, the more time and processing 
poemid be required for its reorganization. Data structure type 
meecets the system's transiency. Logically, transiency would 
imierease With the degree of inversion. Transiency would also 


increase with the degree of chaining utilized. 
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IV. SUMMARY 


Inpcetaubpase System selection, one may hope for but should 
folly expect a perfect system. The circumstances impacting 
fameeme eValuator or designer would considerably affect the 
fbection decision, €.¢., financial and political considera- 
Moms. Another factor 1S the possible contradiction ol WSS Y 
memuirementsS. A case in point 1S a requirement for a data base 
system with a powerful search capability with minimum storage 
mec. Finally, technological progress is a factor. As one 
completes a large data base system design, developments may 
Seetmewiich could be utilized in the system design. 

fercisetLoples regarding data base Systems have been dis- 
Mme cdmin this paper. fYFhey involve current practices in this 
field. Possible tradeoff considerations have been emphasized 
Piteme they are important. The central topic in this paper is 
Semedestructures. [his subject area has been emphasized because 
Peeliceriiportant with respect to the selection of storage struc- 
mimees, tile media and file maintenance functions. This implies 
mmat one can judiciously select data base systems by placing 
great emphasis on data structure considerations. 

PO@iuncaterOries ot data base criteria have been analyzed: 
merteval capability, maintenance capability, storage require- 
fiemesmona accuracy of Cata retrieval. It is by no means 
M@iaiiteem that these criteria are complete. Using these criteria, 
however, could provide some assurance of the correctness of the 
data base evaluation. 
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